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Abstract 

Trace norm regularization is a popular method of multitask learning. 
We give excess risk bounds with explicit dependence on the number of 
tasks, the number of examples per task and properties of the data distri- 
bution. The bounds are independent of the dimension of the input space, 
which may be infinite as in the case of reproducing kernel Hilbert spaces. 
A byproduct of the proof are bounds on the expected norm of sums of 
random positive semidefinite matrices with subexponential moments. 

1 Introduction 

A fundamental limitation of supervised learning is the cost incurred by the 
preparation of the large training samples required for good generalization. A 
potential remedy is offered by multi-task learning: in many cases, while individ- 
ual sample sizes are rather small, there are samples to represent a large number 
of learning tasks, which share some constraining or generative property. This 
common property can be estimated using the entire collection of training sam- 
ples, and if this property is sufficiently simple it should allow better estimation 
of the individual tasks from small individual samples. 

The machine learning community has tried multi-task learning for many 
years (see [3J 21 HH EH HH [201 1211 HI] , contributions and references therein) , but 
there are few theoretical investigations which clearly expose the conditions under 
which multi-task learning is preferable to independent learning. Following the 
seminal work of Baxter (0 [8]) several authors have given generalization and 
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performance bounds under different assumptions of task-relatedness. In this 
paper we consider multi-task learning with trace-norm regularization (TNML) , 
a technique for which efficient algorithms exist and which has been successfully 
applied many times (see e.g. [H HI Q31 ITS]). 

In the learning framework considered here the inputs live in a separable 
Hilbert space H , which may be finite or infinite dimensional, and the outputs 
are real numbers. For each of T tasks an unknown input-output relationship 
is modeled by a distribution fi t on H x K , with fi t (X, Y) being interpreted as 
the probability of observing the input-output pair {X, Y) . We assume bounded 
inputs, for simplicity \\X\\ < 1, where we use ||-|| and (•, •) to denote euclidcan 
norm and inner product in H respectively. 

A predictor is specified by a weight vector w £ H which predicts the output 
(w, x) for an observed input x G H. If the observed output is y a loss £ ((u>, x) , y) 
is incurred, where £ is a fixed loss function on M 2 , assumed to have values in 
[0,1], with £(-,y) being Lipschitz with constant L for each j/gK. The expected 
loss or risk of weight vector w in the context of task t is thus 

Rt (w)=E {XiY) ^ t [£((w,X),Y)}. 

The choice of a weight vector w t for each task t is equivalent to the choice of a 
linear map W : H — ► R T , with (Wx) t = (x, w t ). We seek to choose W so as to 
(nearly) minimize the total average risk R (W) defined by 

1 T 

R{w) = -Y J IW)-* [t ((w t ,X) , Y)] . 
t=i 

Since the fi t are unknown, the minimization is based on a finite sample of 
observations, which for each task t is modelled by a vector Z* of n independent 
random variables Z* = {Z\, . . . , Z l n ), where each Z\ = (Xf,Y?) is distributed 
according to fi t . For most of this paper we make the simplifying assumption that 
all the samples have the same size n. With an appropriate modification of the 
algorithm defined below this assumption can be removed (see Remark [7] below). 
In a similar way the definition of R (W) can be replaced by a weighted average 
which attribute greater weight to tasks which are considered more important. 
The entire multi-sample (Z 1 , . . . , Z T ) is denoted by Z. 

A classical and intuitive learning strategy is empirical risk minimization. 
One decides on a constraint set W C C (if, K T ) for candidate maps and solves 
the problem 

W (Z) = arg min R (W, Z) , 
where the average empirical risk R [W, Z) is defined as 

R(W,Z) = ^ 1 -±£(( Wt ,Xt),Yl). 

t=l i=l 
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If the candidate set W has the form W = {x i-> VFx : (VFx) ( = (x, w t ) , u> t <G B} 
where B C H is some candidate set of vectors, then this is equivalent to single 
task learning, solving for each task the problem 

n 

w t (Z t ) = arg mm - ]T £ ((w, Xf) , F/) . 
i=i 

For proper multi-task learning the set W is chosen such that for a map W 
membership in W implies some mutual dependence between the vectors w t - 

A good candidate set W must fulfill two requirements: it must be large 
enough to contain maps with low risk and small enough that we can find such 
maps from a finite number of examples. The first requirement means that the 
risk of the best map W* in the set, 

W* = arg min R (W) , 

is small. This depends on the set of tasks at hand and is largely a matter of 
domain knowledge. The second requirement is that the risk of the operator 
which we find by empirical risk minimization, W (Z), is not too different from 
the risk of W* , so that the excess risk 



R(w(Z)^) -R(W) 

is small. Bounds on this quantity are the subject of this paper, and, as R (w (Z 
is a random variable, they can only be expected to hold with a certain proba- 
bility. 

For multitask learning with trace-norm regularization (TNML) we suppose 
that W is defined in terms of the trace-norm 

W = [\V eM. dT :\\W\\ 1 <By/rY (1) 

where \\W\\ X = tr ([W*W) 1/2 ^j and B > is a regularization constant. The 
factor \pT is an important normalization which we explain below. We will prove 

Theorem 1 (i) For 5 > with probability at least 1-jinZ 



R (W) -R(W*) < 2LB ^ + 5^ ln ^ + 1 + 



HCHoc , K jMnT) + l\ , ,/2 1n(2/J) 

nT 



where W-W^ is the operator, or spectral norm, and C is the task averaged, un- 
centered data covariance operator 

1 T 

(Cv, ib) = -^ E (x ^ y) ^ Mt (v, X) (X, w) , for w,v £ H. 
t=i 
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(ii) Also with probability 1 — 5 in Z 

/ 

R(w) -R{W*) < 2LB 



\ 



C 



2 (In (nT) + 1) 



nT 



V 



8 In (3/5) 



nT 



J 



T n 

-f E E ( w < x *> w > > /° r «»> » G H - 



with C being the task averaged, uncentered empirical covariance operator 

i T 

Cv, w ) 

a i 

t=i f=i 
Remarks: 

1. The first bound is distribution dependent, the second data-dependent. 

2. Suppose that for an operator W all T column vectors wt are equal to a 
common vector w, as might be the case if all the tasks T are equivalent. In 
this case increasing the number of tasks should not increase the regularizer. 
bmce then WW^ = VT\\ w\\ we have chosen the factor \JT in (JXJ) . It allows 
us to consider the limit T — > oo for a fixed value of B. 

3. In the limit T — > oo the bounds become 



2LB 



\C\\ 



or 2LB 



\ 



C 



respectively. 



4. 



5. 



The limit is finite and it is approached at a rate of yTri (T) /T. 

If the mixture of data distributions is supported on a one dimensional 
subspace then || C|| ^ = E ||^|| 2 and the bound is always worse than stan- 
dard bounds for single task learning as in i6i. The situation is similar if 
the distribution is supported on a very low dimensional subspace. Thus, 
if learning is already easy, TNML will bring no benefit. 

If the mixture of data distributions is uniform on an M -dimensional unit 
sphere in H then ||C|| = 1/M and the corresponding term in the bound 
becomes small. Suppose now that for W — (w%, . . . ,wt) the wt all are 
constrained to be unit vectors lying in some i^-dimensional subspace of 
H, as might be the solution returned by a method of subspace learning 
[3]. If we choose B = K 1 / 2 then W € W, and our bound also applies. 
This subspace corresponds to the property shared shared among the tasks. 
The cost of its estimation vanishes in the limit T — > oo and the bound 
becomes 



2L 



K 

nM' 



K is proportional to the number of bits needed to communicate the utilized 
component of an input vector, given knowledge of the common subspace. 
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M is proportional to the number of bits to communicate an entire input 
vector. In this sense the quantity K/M can be interpreted as the ratio 
of the utilized information K to the available information M, as in [22] . 
If T and M are large and K is small the excess risk can be very small 
even for small sample sizes to. Thus, if learning is difficult (due to data of 
intrinsically high dimension) and the approximation error is small, then 
TNML is superior to single task learning. 

6. An important example of the infinite dimensional case is given when H is 
the reproducing kernel Hilbert space H K generated by a positive semidef- 
inite kernel k : Z x Z — >• K where Z is a set of inputs. This setting is 
important because it allows to learn large classes of nonlinear functions. 
By the representer theorem for matrix regularizers [5] empirical risk min- 
imization within the hypothesis space W reduces to a finite dimensional 
problem in nT 2 variables. 

7. The assumption of equal sample sizes for all tasks is often violated in 
practice. Let nt be the number of examples available for the t-th task. 
The resulting imbalance can be compensated by a modification of the 
regularizer, replacing || x by a weighted trace norm HS'WH l5 where the 
diagonal matrix S = (s\, . . . , st) weights the t-th task with 



where nt is the size of the sample available for the t-th task. With 
this modification the Theorem holds with the average sample size n = 
(1/T)^n t in place of n. In Section [5] we will prove this result, which 
then reduces to Theorem [T] when all the sample sizes are equal. 

The proof of Theorem[T]is based on the well established method of Rademacher 
averages [5] and more recent advances on tail bounds for sums of random ma- 
trices, drawing heavily on the work of Ahlswede and Winter [T], Oliveira [24] 
and Tropp [27]. In this context two auxiliary results are established (Theorem 
[7] and Theorem [8] below), which may be of independent interest. 

2 Earlier work. 

The foundations to a theoretical understanding of multi-task learning were laid 
by J. Baxter in [8], where covering numbers are used to expose the potential 
benefits of multi-task and transfer learning. In [3 Rademacher averages are used 
to give excess risk bounds for a method of multi-task subspace learning. Similar 
results are obtained in [21] . [9] uses a special assumption of task-relatedness to 
give interesting bounds not on the average, but the maximal risk over the tasks. 

A lot of important work on trace norm regularization concerns matrix com- 
pletion, where a matrix is only partially observed and approximated (or under 
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certain assumptions even reconstructed) by a matrix of small trace norm (see 
e.g. [11], [25] and references therein). For H — M. d and T x d-matrices, this 
is somewhat related to the situation considered here, if we identify the tasks 
with the columns of the matrix in question, the input marginal as the uniform 
distribution supported on the basis vectors of M. d and the outputs as defined 
by the matrix values themselves, without or with the addition of noise. One 
essential difference is that matrix completion deals with a known and particu- 
larly simple input distribution, which makes it unclear how bounds for matrix 
completion can be converted to bounds for multitask learning. On the other 
hand our bounds cannot be directly applied to matrix completion, because they 
assume a fixed number of revealed entries for each column. 

Multitask learning is considered in [20] , where special assumptions (coordinatc- 
sparsity of the solution, restricted eigenvalues) are used to derive fast rates and 
the recovery of shared features. Such assumptions are absent in this paper, and 
[2"U] also considers a different regularizer. 

[2"2"] and [T8] seem to be most closely related to the present work. In [32] the 
general form of the bound is very similar to Theorem Q] The result is dimension 
independent, but it falls short of giving the rate of -\/ln (T) /T in the number 
of tasks. Instead it gives T -1 / 4 . 

|18] introduces a general and elegant method to derive bounds for learn- 
ing techniques which employ matrix norms as regularizers. For H = W 1 and 
applied to multi task learning and the trace-norm a data-dependent bound is 
given whose dominant term reads as (omitting constants and observing || W\\-y < 

bVT) 



LB 



Ci 



lnminjr, d} 



where the matrix Ci is the empirical covariance of the data for all tasks observed 
in the i-th observation 



T 
t 



The bound @ docs not paint a clear picture of the role of the number of tasks 
T. Using Theorem [8] below we can estimate its expectation and convert it into 
the distribution dependent bound with dominant term 



LB ^n{T, d] \ ^-r 2 ^ + V Irr ■ (3) 




This is quite similar to Theorem [T] (i). Because © is hinged on the i-th 
observation it is unclear how it can be modified for unequal sample sizes for 
different tasks. The principal disadvantage of @ however is that it diverges in 
the simultaneous limit d, T — > oo. 
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3 Notation and Tools 



The letters H, H', H" will denote finite or infinite dimensional separable real 
Hilbert spaces. 

For a linear map A : H — > H' we denote the adjoint with A*, the range 
by Ran (A) and the null space by Ker(A). A is called compact if the image 
of the open unit ball of H under A is prc-compact (totally bounded) in H' . If 
Ran (H) is finite dimensional then A is compact, finite linear combinations of 
compact linear maps and products with bounded linear maps are compact. A 
linear map A : H — > H is called an operator and self-adjoint if A* = A and 
nonnegative (or positive) if it is self-adjoint and (Ax, x) > (or (Ax,x) > 0) 
for all x e H, x ^ 0, in which case we write A > (or A >~ 0). We use to 
denote the order induced by the cone of nonnegative operators. 

For linear A : H -> H' and B : H' -> H" the product BA : H -> H" is 
defined by (BA)x = B(Ax). Then A* A : H — > iJ is always a nonnegative 
operator. We use H^H^ for the norm || A|| ^ = sup{||Ax|| : ||x|| < 1}. We 
generally assume H^H^ < oo. 

If A is a compact and self-adjoint operator then there exists an orthonormal 
basis ej of H and real numbers \ satisfying | A» | — ^ such that 



where Q ei is the operator defined by Q ei x — (x, ej) e^. The are eigenvectors 
and the Aj eigenvalues of A. If / is a real function defined on a set containing 
all the Aj a self-adjoint operator / (A) is defined by 



/ (A) has the same eigenvectors as A and eigenvalues / (A) . In the sequel 
self-adjoint operators are assumed to be either compact or of the form / (A) 
with A compact (we will encounter no others), so that there always exists a 
basis of eigenvectors. A self-adjoint operator is nonnegative (positive) if all its 
eigenvalues are nonnegative (positive) . If A is positive then In (A) exists and has 
the property In (A) ^ In (B) whenever B is positive and A<B. This property 
of operator monotonicity will be tacitly used in the sequel. 

We write A max (A) for the largest eigenvalue (if it exists), and for nonnegative 
operators A max (•) always exists and coincides with the norm ||-|| . 

A linear subspace M C H is called invariant under A if AM C M. For 
a linear subspace M C H we use M 1 - to denote the orthogonal complement 
M 1 - = {x e H : (x,y) = 0,Vy e M}. For a selfadjoint operator Ran (A) 1 ' = 
Ker (A). For a self-adjoint operator A on H and an invariant subspace M oi A 
the trace ttMA of A relative to M is defined 



f(A) = Y / f(X i )Q ei . 
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where {ei} is a orthonormal basis of M. The choice of basis does not affect the 
value of trM- For M = H we just write tr without subscript. The trace-norm 
of any linear map from H to any Hilbert space is defined as 

m|| 1= tr((^A) 1/2 ). 

If 11^1^ < oo then A is compact. If A is an operator and A y then 1 1 | x is 
simply the sum of eigenvalues of A. In the sequel we will use Hoelder's inequality 
[10] for linear maps in the following form. 

Theorem 2 Let A and B be two linear maps H — > M. T . Then \tr(A*B)\ < 

Pill Halloo- 

Rank-1 operators and covariance operators. For mi £ if we define an 
operator Q w by 

Q w v = (v,w)w, for v £ H. 

In matrix notation this would be the matrix ww* . It can also be written as the 
tensor product w <g> w. We apologize for the unusual notation Q w , but it will 
save space in many of the formulas below. The covariance operators in Theorem 
Q] are then given by 

C = \ J2 E (x,Y)^ t Qx and C = — Qxj ■ 

t ' t,i 

Here and in the sequel the Rademacher variables o~\ (or sometimes Oi) are uni- 
formly distributed on {0, 1}, mutually independent and independent of all other 
random variables, and E CT is the expectation conditional on all other random 
variables present. We conclude this section with two lemmata. Two numbers 
p, q > 1 are called conjugate exponents if 1/p + l/q= 1. 

Lemma 3 (i) Letp, q be conjugate exponents and s, a > 0. Then (y/s + pa — y/a) 
s/q. (ii)Fora,b>0 

min \J 'pa + qb = \[a + Vb. 

q,p>l and 1/q+l/p—l 

(Hi) and for a,b > we have 2\fab < (p — 1) a + (q — 1) b. 

Proof. For conjugate exponents p and q we have p — 1 = p/q and q — 1 = q/p. 

Therefore pa + qb — (^fa + Vbj = i^\Jpa/q — \/qb/pj > 0, which proves (hi) 

and gives \/pa + qb > ^/a + Vb. Take s = qb, subtract i/a and square to get (i) . 
Set p = 1 + \Jbja and q = 1 + \Ja/b to get (ii). ■ 

Lemma 4 Let a, c > 0, b > 1 and suppose the real random variable X > 
satisfies Pr {X > pa + s} < feexp (— s/ (cq)) for all s > and all conjugate 
exponents p and q. Then 



EX < \fa + yjc (In 6+ 1). 



Proof. We use partial integration. 

EX < pa + qcAnb+ / Pr {X > pa + s} ds 

J qc In b 
poo 

< pa + qclnb + b e~ s/i - cq) ds = pa + q (c\nb + 1) . 

J qc In b 

Take the square root of both sides and use Lemma [3] (ii) to optimize in p and q 
to obtain the conclusion. ■ 



4 Sums of random operators 

In this section we prove two concentration results for sums of nonnegative op- 
erators with hnite dimensional ranges. The first (Theorem [7]) assumes only a 
weak form of boundedness, but it is strongly dimension dependent. The second 
result (Theorem [5]) is the opposite. We will use the following important result 
of Tropp (Lemma 3.4 in [27]), derived from Lieb's concavity theorem (see [TP] , 
Section IX.6): 

Theorem 5 Consider a finite sequence Ak of independent, random, self-adjoint 
operators and a finite dimensional subspace M C H such that A^M C M. Then 
forO eR 



E tr M exp ^B^Ai^j < tr M exp ^^lnE 
A corollary suited to our applications is the following 



„BA,. 



Theorem 6 Let A\, . . . , An be of independent, random, self-adjoint operators 
on H and let M C H be a nontrivial, finite dimensional subspace such that 
Ran(Ak) C M a.s. for all k. 
(i) If Ak y a.s then 



E exp 



^ < dim (M) exp (\ max ^lnEe^J 



(ii) If the Ak are symmetrically distributed then 



E exp 



< 2 dim (M) exp A 



max I / 



^lnEe y 



Proof. Let A = J2k A k- Observe that M x C Ker(A) n (U k Ker (A k )), and 
that M is a nontrivial invariant subspace for A as well as for all the Ak- 
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(i) Assume A k > 0. Then also A^O. Since M 1 - C Ker (A) there is x x € M 
with |jxi|| = 1 and Axi = ||A|| x\ (this also holds if A — 0, since M is nontrivial). 



Thus e^xi 



x\. Extending x\ to a basis {ir^} of M we get 



(e A xi,xx) <^{e A x i ,x l ) = tr M t 



Theorem [5] applied to the matrices which represent A^ restricted to the finite 
dimensional invariant subspace M then gives 



Ee l|A|1 < E tr M e A < tr M 



exp 



Y In (Ee 



< dim (M) exp f A max |E In (®e Ak )jj , 



where the last inequality results from bounding trM by dim (M) A max and 
Amax (exp (•)) = exp (A max (•)). 

(ii) Assume that Ak is symmetrically distributed. Then so is A. Since 
M 1 - C Ker (A) there is x\ € M with ||xi|| = 1 and either Ax\ — \\A\\ x\ or 



-Ax\ = \\A\\ xi, so that either e A x\ = e^ A ^xi or 
to a basis again gives 



e~ A x\ = e^ A \ 



x\ . Extending 



e l|y111 < (e A xi, xi) + (e~ A X!,xi) < tr M e A + tr M e~ A . 

By symmetric distribution we have 

Ee l|A|1 < tr M (Ee A +Eer A ) < 2E tr M e A . 

Then continue as in case (ii). ■ 

The following is our first technical tool. 

Theorem 7 Let A I C H be a subspace of dimension d and suppose that A\ , . . . , An 
are independent random operators satisfying A^ y 0, Ran(Ak) C M a.s. and 



EA™ < mlR^EAk 



(4) 



for some R > 0, all m G N and all k G {1,...,A}. Then for s > and 
conjugate exponents p and q 



Pr • 



Also 



\ 



{ 








k 


oo 


E 


E^ 






fe 


oo 



>p 



+ s > < dim (M) e" 



s/(qR) 



< 



\ 



eE a * 



+ v/i?(lndim(M) + 1). 
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Proof. Let 9 be any number satisfying < 9 < From Q we get for any 

fee {!,... ,N} 



Ec 



9A k 



00 nm 

£ — ' rrv £ — ' 



= I 



1-R0 



lA k < exp 



1 - RO 



Abbreviate \i — ||E^ fc Afc|| and let r = s + p[i and set 



so that < 9 < 1/R. Applying the above inequality and the operator mono- 
tonicity of the logarithm we get for all k that In E cxp (9 At) -< 9/ (1 — R9) EAk- 
Summing this relation over k and passing to the largest eigenvalue yields 



^lnEf 



8A h 



< 



6)i 



1-R9 



Now we combine Markov's inequality with Theorem [5] (i) and the last inequality 
to obtain 



Pr 



(IE 



A, 



> r } < e 



-Or 



E exp 9 



< dim (M) e- 0r exp |^A, 

< dim (AT) exp ( -9r + 



^lnE< 



8A k 



1 - R9' 



dim (Af) exp ( (s/r - 07) 2 



By Lemma [3] (i) (^/r — ^/Jt)" = (V s + — \fT*) > s /li 80 this proves the first 
conclusion. The second follows from the first and Lemma |H ■ 

The next result and its proof are essentially due to Oliveira ([21], Lemma 1, 
but see also [23] . We give a slightly more general version which eliminates the 
assumption of identical distribution and has smaller constants. 

Theorem 8 Let A\ , . . . , An be independent random operators satisfying ^ 
Ak d I and suppose that for some d € N 



dim Span (Ran (A\) , . . . , Ran (An)) < d 



(5) 



almost surely. Then 
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(i) 



Pr- 



> s > < Ad exp 



9|IE*E^*IL + 6* 



(ii) 



(Hi) 



Pr 



+ 4 < 4d 2 e- s /( 6 «) 



\ 



IE 



5> 



+ V6(ln(4d 2 ) + 1) 



In the previous theorem the subspace M was deterministic and had to con- 
tain the ranges of all possible random realizations of the A k . By contrast the 
span appearing in ([5]) is the random subspace spanned by a single random real- 
ization of the A k . If all the A k have rank one, for example, we can take d = N 
and apply the present theorem even if each EAfc has infinite rank. This allows to 
estimate the empirical covariance in terms of the true covariance for a bounded 
data distribution in an infinite dimensional space. 

Proof. Let < 9 < 1/4 and abbreviate A = J^k^k- A standard symmetriza- 
tion argument (see |19l . Lemma 6.3) shows that 



Ee e\\A-EA\\ < exp 2e 



^(^kAk 



where the a k are Rademacher variables and E CT is the expectation conditional on 
the A\, . . . , An- For fixed A\, . . . , An let M be the linear span of their ranges, 
which has dimension at most d and also contains the ranges of the symmetrically 
distributed operators 29a kA k - Invoking Theorem |6] (ii) we get 



E CT exp 29 



fc 



< 2c? exp An 



< 2d exp 29 1 



^lnE ff( 



28o k A k 



£4 

k 



< 2d exp(26> 2 p|| 



The second inequality comes from E CT e 



28a k A k 



cosh(2M fc ) X e 2 



and the 



fact that for positive operators A max and the norm coincide. The last inequality 



follows from the implications ■< A k < I 
=>■ | Efc^fell < ||A||. Now we take the expectation in Ai, 
with the previous inequalities we obtain 



J2k At - J2k Ak 
..,A N - Together 



Ee e\\A-MA\\ < 2dEe 2^||A|| < 2dEe 26*\\A-TLA\\ e 26*\\mA\\ 

20 



<2d (Ee e ^ EA ^y 
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The last inequality holds by Jensen's inequality since 9 < 1/4 < 1/2. Dividing 
by (Eexp (9 \\A- EA\\)) 29 , taking the power of 1/ (1 - 29) and multiplying with 
e 6s gives 

Pr{||A-EA|| > s}< e ~es Ee 8\\A-EA\\ < (2d) 1/{1 - 20) exp (j^^ \\EA\\ - 9s^j . 

Since 9 < 1/4, we have (2d) 1/(1 ~ 29) < (2d) 2 . Substitution of 6 = a/ (6 \\EA\\ + 4s) < 
1/4 together with some simplifications gives (i). 

It follows from elementary algebra that for 6 > with probability at least 
1 — S we have 

||A|| < ||EA|| + 2^EA\\^j J In (4d 2 /S) + 6 In (id 2 / '6) 
< p\\EA\\+6q\n(Ad 2 /6) , 

where the last line follows from (9/4) < 6 and Lemma [3] (iii). Equating the 
second term in the last line to s and solving for the probability 8 we obtain (ii) , 
and (iii) follows from Lemma HI ■ 



5 Proof of Theorem [T] 

We prove the excess risk bound for heterogeneous sample sizes with the weighted 
trace norm as in Remark [7] following the statement of Theorem [TJ The sample 
size for the n-th task is thus nt and we abbreviate n for the average sample 
size, n = (1/T) n t , so that nT is the total number of examples. The class of 
linear maps W considered is 



W = {W e 



\sw\i, < bVt} 



with S = (si, . . . , st) and Sj = ^Jn/n t . With VV so defined we will prove the 
inequalities in Theorem Q] with n replaced by n. The result then reduces to 
Theorem 1 if all the sample sizes are equal. 

The first steps in the proof follow a standard pattern. We write 



R[Wj -R(W*) 

r(w^ -r(w,z 



R 



(w,zj 



R(W*,Z) 



+ 



R(W*,Z) -R(W* 



The second term is always negative by the definition of W. The third term 
depends only on W*. Using Hoeffding's inequality [16 it can be bounded with 
probability at least 1 — S by y/hi (1/5) / (2nT). There remains the first term 
which we bound by 

sup R(W) - R(W) . 
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It has by now become a standard technique (see [B]) to show that this quantity 
is with probability at least 1 — 5 bounded by 



or 



E^K (W, Z) 



n(w,z) + 



In (1/5) 
2nT 



9 In (2/5) 
2nT 



(6) 



(7) 



where the empirical Rademacher complexity 1Z (W, ZJ is defined for a multi- 
sample Z with values in (H x R) nT by 

2 t m 

K (W, Z) = -E CT sup - E a ^ (K X > *?) • 



wew 



Standard results on Rademacher averages allow us to eliminate the Lipschitz 
loss functions and give us 



T n t 



wew : 



2L 
~T 
or 

— E a sup ir (W*L>) = ^E ff sup tr (W*SS^D) , 
T ivew T wew 



U(W,z) < —E a supJ2J2 a H w t,Xf/n t ) 

2L 



where the random operator D : H — > M T is defined for v 6 H by (Dv) t — 
(v, Y^iLi a \^\l n t)i an d the diagonal matrix 5 is as above. Holder's and Jensen's 
inequalities give 



n (w, z) 



< 



2L 



_ sup wsww^As^d] 

1 wew 



2LB, 



nVT 



E a llS^Dl 



Let Vf be the random vector V t = ^"=1 a \^-ll ( s t n t) and recall that the induced 
rank-one operator Q Vt is defined by Qy t v = (v, V t ) V t = (1/ (nn t )) ^ . («, <r*X*) cr*.Xj. 

Then D*S~ 2 D = Qvi, so we obtain 



< 



2LB 



as the central object which needs to be bounded. 

Observe that the range of any Qy t lies in the subspace 

M = Span {X\ : 1 < t < T and 1 < i < n t ) 

which has dimension dimM < fiT < oo. We can therefore pull the expectation 
inside the norm using Theorem [7] if we can verify a subexponential bound ((4]) 
on the moments of the Qv t ■ This is the content of the following lemma. 
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Lemma 9 Let X\, . . . ,x n be in H and satisfy \\xi\\ < b. Define a random vector 
by V — J2i °~i x i- Then for m > 1 

niQv) m ]^m\ {2nb 2 ) m ^E[Q v ]. 

Proof. Let K m ^ n be the set of all sequences (ji, . . . , jim) with jk € {1, . . . , n}, 
such that each integer in {1, . . . , n} occurs an even number of times. It is easily 
shown by induction that the number of sequences in K. m n is bounded by 

\K m , n \ < (2m-l)!!n"\ 

where (2m - 1)!! = J]™! ( 2i ~ *) < m!2 m_1 . 

Now let v € H be arbitrary. By the definition of V and Qv we have for any 
v e H that 

n 

(E [(Qv)" 1 } v,v)= E E [°h°h ■ ■ ■ a hm] ( v > x h) ( x j2, x j 3 ) ■ ■ ■ ( x hm, v ) ■ 

The properties of independent Rademacher variables imply that E [aj 1 aj 2 ■ ■ ■ aj 2m } — 
1 if j e K min and zero otherwise. For m = 1 this shows (E [{Qv) m \ v, v) = 
(E [Qv] v , v ) = J2j ( w : x jY ' ■ For m > 1, since ||a;j|| < b and by two applications 
of the Cauchy-Schwarz inequality 

(E[(Q v ) m ]v,v) = E ( v ' x h) { x 32,x j3 )---{x hm ,v) 

< fe 2 ^- 1 ) e 1(^^)11(^^)1 

j£K m , n 

< & 2(m - 1} E <^> 2 E <«.**J S 
= b^-vj2(v, Xj ) 2 . e 1 

j jG-Km.n such that ji=j 

= (E [QV] v, v) x (2m - l)!!n m - 1 6 2(m - 1) 

< m!(2n& 2 )" l_1 (E[Qy] U , V ). 

The conclusion follows since for self-adjoint matrices (Vt>, (Aw, w) < (Bv, v) ) =>■ 
A^S. ■ 

If we apply this lemma to the vectors V t defined above with b = 1/ (s t n t ), 
using s\n t = n, we obtain 



E[(Qy t ) m ] ( -5-J E[Qv]=m!(^-) E [Qy] , . 
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Applying the last conclusion of Theorem[7]with R = 2/n and d = nT now yields 



and since J2t ^<rQv t = J2t (V") Si (V n t) Qxj = TC/n we get 

2LB 



n (w, z) < 



< 2LB 



\ 



c 






X 



2 (In (nT) + 1) 



nT 



V 



(8) 



/ 



Together with (JT]) and the initial remarks in this section this proves the second 
part of Theorem [TJ 

To obtain the first assertion we take the expectation of ([SJ and use Jensen's 



inequality, which then confronts us with the problem of bounding E 



C 



terms of ||C|| 



EC 



Note that nTC = J2 t E?=i Qxj- Here Theorem [7] 
doesn't help because the covariance may have infinite rank, so that we cannot 
find a finite dimensional subspace containing the ranges of all the Qx t - But 
since ||X*|| < 1 all the Q x * satisfy -< Qx* d: I and are rank-one operators, we 
can invoke Theorem [8] with d = fiT. This gives 



E 



C 



<V\\c\\ 



6 (In (4nT) + 1) 



fiT 



and from © and Jensen's inequality and some simplifications we obtain 

/ 



ETZ (W, Z) < 2LB 



\ 









E 










■DC 



2 (In (nT) + 1) 



V 



< 2LB 



|C|| 



nJ 



which, together with ©, gives the first assertion of Theorem [T] 

A similar application of Theorem [8] applied to the bound ([2]) in [18] yields 
the bound ©. 
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